Wednesday, October 27, 2021

Nobody Understands the Relational Model: Semantics, Relational Closure and Database Correctness Part 1



 with David McGoveran 

(Title inspired by Richard Feynman)

“As currently defined, relational algebra produces anomalies when applied to non-5NF relations. Since an algebra cannot have anomalies, they should have raised a red flag that RA was not defined quite right, especially defining "relation" as a 1NF table and claiming algebraic closure because 1NF was preserved. Being restricted to tabular representation as the "language" for relationships is like being restricted to arithmetic when doing higher mathematics like differential calculus -- you need more expressive power, not less! Defining RA operations in terms of table manipulations aided initial learning and implementations by making data management look simple and VISUAL. Unfortunately, it was never grasped how much was missing, let alone how much more "intelligent" the RA and the RDBMS needed to be made to fix the problems. And I can see that those oversights were, in part, probably due to having to spend so much time correcting the ignorance in the industry."
--David McGoveran
I recently posted the following Fundamental Truth of the Week on LinkedIn, together with links to more detailed discussions of 1NF and 5NF (see References):
“According to conventional understanding of the RDM (such as it is) [and I don't mean SQL], a relation is in at least first normal form (1NF) -- it has only attributes drawn from simple domains (i.e., no "nested relations") -- the formal way of saying that a relation represents at the logical level an entity group from the conceptual level that has only individual entities -- no groups thereof -- as members. 1NF is required for decidability of the data sublanguage.

However, correctness, namely (1) system-guaranteed logical validity (i.e., query results follow provably from the database) and (2) by-design semantic consistency (of query results with the conceptual model) requires that relations are in both 1NF and fifth normal form (5NF). Formally, the only dependencies that hold in a 5NF relation are functional dependencies of non-key attributes on the PK -- for each PK value there is exactly one value of every corresponding non-key attribute value. This is the formal way of saying that a relation represents facts about a group of entities of a single type.

Therefore we now contend that database relations are BY DEFINITION in both 1NF and 5NF, otherwise all bets are off.”
It triggered a discussion that raised some fundamental issues for which an online exchange is too limiting. This post offers further clarifications, including comments by David McGoveran, on whose interpretation of the RDM (LOGIC FOR SERIOUS DATABASE FOLKS, forthcoming) I rely on. The portions of my interlocutor in the discussion are in quotes.

------------------------------------------------------------------------------------------------------------------

SUPPORT THIS SITE
DBDebunk was maintained and kept free with the proceeds from my @AllAnalitics column. The site was discontinued in 2018. The content here is not available anywhere else, so if you deem it useful, particularly if you are a regular reader, please help upkeep it by purchasing publications, or donating. On-site seminars and consulting are available.Thank you.

LATEST POSTS

- 10/09 Relational and Referential Integrity

- 09/19 TYFK: Calculated Attributes -- Redundancy, Full Normalization and Relational Theory

- 09/11 OBG: Data Warehouses Are Non-Relational Application Views

LATEST PUBLICATIONS (order from PAPERS and BOOKS pages)
- 08/19 Logical Symmetric Access, Data Sub-language, Kinds of Relations, Database Redundancy and Consistency, paper #2 in the new UNDERSTANDING THE REAL RDM series.
- 02/18 The Key to Relational Keys: A New Understanding, a new edition of paper #4 in the PRACTICAL DATABASE FOUNDATIONS series.
- 04/17 Interpretation and Representation of Database Relations, paper #1 in the new UNDERSTANDING THE REAL RDM series.
- 10/16 THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS, my latest book (reviewed by Craig Mullins, Todd Everett, Toon Koppelaars, Davide Mauri).

USING THIS SITE
- To work around Blogger limitations, the labels are mostly abbreviations or acronyms of the terms listed on the
FUNDAMENTALS page. For detailed instructions on how to understand and use the labels in conjunction with the that page, see the ABOUT page. The 2017 and 2016 posts, including earlier posts rewritten in 2017 were relabeled accordingly. As other older posts are rewritten, they will also be relabeled. For all other older posts use Blogger search.
- The links to my columns there no longer work. I moved only the 2017 columns to dbdebunk, within which only links to sources external to AllAnalytics may work or not.

SOCIAL MEDIA
I deleted my Facebook account. You can follow me:
- @DBDdebunk on Twitter: will link to new posts to this site, as well as To Laugh or Cry? and What's Wrong with This Picture? posts, and my exchanges on LinkedIn.
- @ThePostWest on Twitter where I comment on global #Antisemitism/#AntiZionism and the Arab-Israeli conflict.

------------------------------------------------------------------------------------------------------------------

“Am I correct in understanding that you're saying that a *relation* is always in 1NF and 5NF, but that doesn't mean the *database* is in 2NF?”

Well:

  • Normal forms are properties of relations, not databases. "Database in 1NF+5NF" is just a shorthand for a set of relations such that no relation therein is not in both 1NF+5NF, essentially the definition of a relational database.
  • Relations in 5NF are by definition also in 1NF, 2NF, 3NF and 4NF (why?), so how can a relational database not be in 2NF?

Note: 5NF subsumes 1NF, but we use 1NF+5NF to stress the qualitative difference between 1NF and 2-5NF:

  • two distinct ways to "bundle" entity groups into single relations:
  • <1NF has data sublanguage decidability implications;
  • 2NF-4NF have dependencies, redundancy and update anomalies implications.

 “So, in reading through:

"[In a] database relation ... [t]he only dependencies that hold ... are functional dependencies of the non-key attributes on the PK."

A couple of paragraphs later you said:This is not a relation in your definition (functional dependency), so that contradicts relational closure. What do I or you misunderstand?

"...relations can be manipulated mathematically as sets by the relational algebra (RA) with relations as results (relational closure)."

While I think this makes sense in terms of Date-style relations, I don't see how it is internally consistent. Because if we have a relation
{
( id: 1, street: "123 any lane", zip: 12345"),
( id: 2, street: "456 any lane", zip: 12345"),
( id: 3, street: "333 else drive", zip: 54321")
}
And another relation
{
( zip: 12345, city: "somewhere" ),
( zip: 54321, city: "elsewhere" )
}
And we perform a natural join, then the resulting set is obviously
{
( id: 1, street: "123 any lane", zip: 12345", city: "somewhere"),
( id: 2, street: "456 any lane", zip: 12345", city: "somewhere"),
( id: 3, street: "333 else drive", zip: 54321", city: "elsewhere")
}

Which would mean that If you have two relations (normalized), and you join them together, the result will frequently introduce new functional dependencies not based on the key, so the result might not be normalized, and therefore the result wouldn't be a relation? And so relations wouldn't be closed under joins?”

The RDM is applied theory, an adaptation of mathematical simple set theory (SST) expressible in FOPL to database management. Relations (short for mathematical relations) are abstract -- they don't represent anything in the real world. Databases, on the other hand, are formal representations of conceptual models of reality, so database relations are not abstract, they represent related entity groups and as part of the adaptation are constrained to be consistent with the models (i.e., to represent the groups faithfully), while preserving mathematical properties.

Here is, however, a critical bit of the theoretical foundation of the RDM that practically nobody in the industry is aware of: semantic correctness mandates adherence to three database design principles. One of them introduced by McGoveran, POOD, has received some attention, but not the other two. According to a McGoveran (not yet proven) conjecture, jointly the three imply a fourth -- Principle of Full Normalization (POFN) -- though not vice-versa. In other words, for RA to have semantic correctness, one necessary adjustment in Codd's adaptation of mathematical relations to database management is that database relations are not only in 1NF (as required by Codd, Date, and others), but also in 5NF. This is a formal way of saying that a database relation represents a group of entities of a single type (5NF) that has no groups as members (1NF). This, as we shall see, has implications for the definition of RA.

Note very carefully, therefore, that every database relation qualifies mathematically as a relation, but not every relation qualifies as a database relation. The ADDRESSES-ZIPS join is in 1NF but not in 5NF and, thus, does not.

“...you seem to be making a new set of semantics where "relations" are normalized (or borrowing a set of semantics from somewhere unknown to me) ... Unless you've got a new name for what TTM [Date & Darwen's THE THIRD MANIFESTO) calls relations, (let's call them "Fabians"), and then if Fabians are closed under join, then all you're doing is playing semantics, but you don't want to talk about semantics either ('The point is not "what to call it"'.)”
As already mentioned, we refer to mathematical relations as relations and to those in 1NF+5NF as database relations. TTM relations are conventional -- in 1NF, but not necessarily in 5NF, in which case they are not database relations (though they are intended to be). This "naming semantics" is an attempt to be consistent about important distinctions with important consequences, not "playing semantics".
“...your claim of update anomalies ... I think it's irrelevant. It feels like talking about update anomalies is ignoring a bigger problem in favor of a trivial one. I'll explain: Update anomalies are bugs, bugs of a type when data is not normalized, and updates to the duplicated data are not properly specified so the duplicated data isn't properly updated. But that's an entirely different class of problem than an inconsistent algebra. It's like saying "well, some people incorrectly solve the quadratic equation, that's just as big of a problem as arithmetic (an algebra) where integers isn't closed under addition, but I'm going to claim that it is". Update anomalies are not as big of a problem as an algebra where relations aren't closed under join.”
We could not agree more with the last statement. Unfortunately, update anomalies, closure, and how relational operators were defined are all interrelated and represent an even "bigger problem". Update anomalies are not "bugs", let alone irrelevant, but actually a reflection of  that much bigger problem.

(Continued in Part 2)


References

First Normal Form in Theory and Practice (series)
Simple Domains and Value Atomicity
Normalization and Further Normalization (series)
5NF, Association Relations, and Join 

Database Design: What It Is and Isn't

No comments:

Post a Comment

View My Stats